Semi-Supervised Text Classification Using EM

نویسندگان

  • Kamal Nigam
  • Andrew McCallum
  • Tom Mitchell
چکیده

For several decades, statisticians have advocated using a combination of labeled and unlabeled data to train classifiers by estimating parameters of a generative model through iterative Expectation-Maximization (EM) techniques. This chapter explores the effectiveness of this approach when applied to the domain of text classification. Text documents are represented here with a bag-of-words model, which leads to a generative classification model based on a mixture of multinomials. This model is an extremely simplistic representation of the complexities of written text. This chapter explains and illustrates three key points about semi-supervised learning for text classification with generative models. First, despite the simplistic representation, some text domains have a high positive correlation between generative model probability and classification accuracy. In these domains, a straightforward application of EM with the naive Bayes text model works well. Second, some text domains do not have this correlation. Here we can adopt a more expressive and appropriate generative model that does have a positive correlation. In these domains, semi-supervised learning again improves classification accuracy. Finally, EM suffers from the problem of local maxima, especially in high dimension domains such as text classification. We demonstrate that deterministic annealing, a variant of EM, can help overcome the problem of local maxima and increase classification accuracy further when the generative model is appropriate.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparison of Discriminative EM-Based Semi-Supervised Learning algorithms on Agreement/Disagreement classification

Recently, semi-supervised learning has been an active research topic in the natural language processing community, to save effort in hand-labeling for data-driven learning and to exploit a large amount of readily available unlabeled text. In this paper, we apply EM-based semi-supervised learning algorithms such as traditional EM, co-EM, and cross validation EM to the task of agreement/disagreem...

متن کامل

Large Scale Text Classification using Semisupervised Multinomial Naive Bayes

Numerous semi-supervised learning methods have been proposed to augment Multinomial Naive Bayes (MNB) using unlabeled documents, but their use in practice is often limited due to implementation difficulty, inconsistent prediction performance, or high computational cost. In this paper, we propose a new, very simple semi-supervised extension of MNB, called Semi-supervised Frequency Estimate (SFE)...

متن کامل

Semi-supervised Learning Approach for Automatic Emotional Expression Extraction from eBook Text

We have developed an approach for the automatic extraction of emotion expression from text data of ebooks, such as novels and short stories. The embedding of the extraction results as metadata allows a text-to-speech system to enable the expressive reading of these texts along with the selection of a dictionary of voices associated with emotions. As a text prefilter for the automatic extraction...

متن کامل

Cross Language Text Classification by Model Translation and Semi-Supervised Learning

In this paper, we introduce a method that automatically builds text classifiers in a new language by training on already labeled data in another language. Our method transfers the classification knowledge across languages by translating the model features and by using an Expectation Maximization (EM) algorithm that naturally takes into account the ambiguity associated with the translation of a ...

متن کامل

Combining Labeled and Unlabeled Data for MultiClass Text Categorization

Supervised learning techniques for text classi cation often require a large number of labeled examples to learn accurately. One way to reduce the amount of labeled data required is to develop algorithms that can learn e ectively from a small number of labeled examples augmented with a large number of unlabeled examples. Current text learning techniques for combining labeled and unlabeled, such ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006